From Messy to Meaningful: Integrating Big Data in Palaeoecology

Gavin Simpson

Aarhus University

2026-01-07

What are our data?

What does AI think?

Source: Sora

…but can we trust AI

Source: Sora

What are our data?

Palaeo data come in many forms

  • counts
  • %
  • biomass
  • concentrations
  • presence / absence (?)
  • presence only?
  • whatever the heck metagenomics / metabarcoding produces

Is this a problem?

Probably not

  • in a single site study
  • in a curated study of selected sites

Becomes a problem when we start to collate data into a large database

Why is this a problem

Even within a single proxy we have inconsistent data

  • raw data may not be available for some locations, only %
  • everyone counts things differently
  • everyone uses their own taxonomy

Where we’ve sampled almost surely is not randomconvenience sample

Almost surely irregularly spaced in time

So much for old news…

No one in this room needs to be told any of this

Source: Sora

What do we want to do with them?

What do we want to do?

Themes from a brief search of literature (AKA things I’ve been involved with)

  • conservation
  • diversity
  • resilience / sensitivity to change
  • ecological interactions
  • co-occurrence among proxies
  • aquatic – terrestrial linkages

Long term changes in spatial organisation and distribution of species and ecosystems

How might we do it?

How might we do it?

Traditional methods used in palaeo are unlikely to help with analyses that compare across more than two taxonomic groups or data of different types

  • coinertia / cocorrespondence analysis — pairs of data

What do we do if we have different resolution data within a proxy?

Or different data representations?

Or different proxies at different sites / samples

Can we use all the data?

Maslow’s hammer

Source: Sora

Source: Jeff Steinke

Newer methods

Lots of developments in the statistical ecology and omics worlds we can take advantage of

  • integrated SDMs

  • joint species distribution models

  • model-based ordination

  • copula models (marginal models for multivariate responses)

  • graphical models & networks

Not just for the sake of being novel

Newer methods enable estimation of new quantities — new/better answers to questions

Integrated SDMs

Integrated species distribution models

General way to combine — integrate — disparate data

  1. species’ distributions are aggregated spatial locations of all individuals of the same species across a geographical domain

  2. the distribution can be described by a spatial point process, where local intensity (density) of individuals varies

  3. SDMs are a direct or indirect model of this underlying point process

  4. data integration requires linking each data source to the common underlying point process while accounting for differences among data types

What is a point process?

A spatial point process describes the distribution of event locations across some spatial domain

Random process generating points, described by the local intensity \(\lambda_{s}\)

\(\lambda_{s}\) — expected density of points at spatial location \(s\)

If points are random, independent and follow a Poisson distribution with mean \(\lambda_{s}\), homogeneous Poisson process (\(\lambda_{s} = \lambda \; \forall \; s\))

If \(\lambda_{s}\) varies across \(s\), we have an inhomogeneous Poisson process

Other distributions are available

These work in time as well

Miller et al (2019). Methods Ecol. Evol. 10.1111/2041-210X.13110

Joint likelihood

The different data sets have their own “model” and the likelihoods are combined during fitting

Allows mixing of different types of data

  • pointedSDMs 📦

Similar idea to combine likelihoods from different types of data

  • Jim Clarke’s Generalized joint attribute model (GJAM) in gjam 📦
  • gfam() family in Simon Wood’s mgcv 📦
  • Dover, Popovic, & Warton (2023, MEE) scampr 📦

Miller et al (2019). Methods Ecol. Evol. 10.1111/2041-210X.13110

JSDMs

Instead of modelling one species at a time and stacking the models, Joint Species Distribution Models estimate all species at once

Ideally we’d combine integrated SDMs with JSDMs but as yet, I’m not aware of much work yet (but see Gelfand & Schliep, 2025)

JSDMs can be used to fit model-based ordinations — might have to move away from traditional ordination methods to handle features of our data properly

  • gllvm
  • ecoCopula
  • boral
  • mvgam

We don’t have to repeat everything that our non-palaeo ecologist colleagues have worked through already — jump to the head of the line

Change of support

What if we don’t have the same proxies measured at the same set of sites? — spatial misalignment

What if proxies represent different amounts of space (time)?

This is covered under the problem of change of support and the concept of data fusion

Computation

For larger data sets, computation using these newer methods becomes difficult

  • more parameters vs. simplifications & approximations

  • newer methods & algorithms, GPUs, etc are helping with this

Most methods demonstrated with 10s of taxa — Galore (Pound & O’Keefe; 2025 Palynology) pollen & spore data set has >1000 genera

Description

Even describing larger data sets presents challenges

Want lower dimensional view of the data — topic models

  • Summarise each sample as being made up of proportions of \(A \ll m\) “associations” (don’t hate on me!)

  • “associations” are learned from the data — proportions of each taxon in each “association”

  • each individual in a sample is modelled as a draw from the distribution of “associations” and then a draw of a taxon from that “association”

  • display (model?) data using these \(A \ll m\) “associations”

Latent Dirichlet Allocation

Source: Giphy

Summary

Working with large, disparate, heterogeneous data sets is hard

Using newer statistical approaches is essential to handle this heterogeneity

Some progress has been made — more happening all the time

  • Data integration
  • Data fusion
  • Joint models
  • Dimension reduction (e.g. topic models)

Thank you

Slide deck on GitHub

Extra

Diversity

Diversity metrics are very non-Gaussian

Any modelling of “diversity” needs to handle the sediment accumulation problem

Time averaging different amounts of time per sample leads to

  • heteroscedasticity
  • different effort — biases species richness etc

Same problem affects any modelling of any palaeo data, save for annually laminated records…

Effort problems plague “microbiome”-type data

Diversity

Rare or data-deficient species?

Large training sets — throw out rare species, singletons etc

eDNA — “filtering” throws away a lot of data (& please don’t rarefy to counts)

Hierarchical models involving random “effects” allow us to borrow strength from more data-rich taxa

Sharma et al (in press). No species left behind: borrowing strength to map data-deficient species. Trends Ecol. Evol. 10.1016/j.tree.2025.04.010

Omics

Over in the Omics cinematic universe, those folks are doing their own thing integrating disparate kinds of data

Popular techniques are focused around extensions to PLS

Multiple different types of omics analysis on the same samples

What if we can’t?

If we can’t / don’t want to use these newer methods, what can we do with dissimilarities?

Fused dissimilarities

  • compute dissimilarity among samples for a single proxy / type of data separately
  • compute the fused dissimilarity \[d_{\text{fused}_{jk}} = w d_{x_{jk}} + (1 - w)d_{y_{jk}}\]
  • extends to \(\mathcal{N}\) different data sets \[d_{\text{fused}_{jk}} = \sum_{i = 1}^{\mathcal{N}} w_i d_{i_{jk}}, \;\; \text{where} \sum_{i=1}^{\mathcal{N}} w_i = 1\]

Then analyse using NMDS or db-RDA, etc.

The future?

Extirpation

Very hard to say taxon x extirpated from this lake at this time

Most palaeo data is presence only

Possible with associated marks — abundance or biomass conditional upon the taxon being found

We don’t know (statistical) things about the taxa we don’t observe

Hard to put a probability on (e.g.) extirpation with this data

Repeat counts

But ecologists have been doing this kind of work for decades — occupancy modelling

Most methods require repeated sampling

What would that look like for palaeo?

Could we count same number of things but over \(n \geq 2\) different “samples”?